wd <- "/home/victor/projects/forecastflows/analyses"
set.seed(20012029)
library(ggplot2)
lpi_subset <- read.table(file.path(wd, "jdavies/African_mammals.txt"))
lpi_subset <- lpi_subset[, c("ID", "Genus", "Species", "Units", paste0("X", 1950:2020))]
lpi_subset$genus_species <- paste(lpi_subset$Genus, lpi_subset$Species)
lpi_subset_rsh <-
reshape(lpi_subset, direction = "long", idvar = "ID",
varying = paste0("X", 1950:2020), v.names = c("value"), timevar = "year", times = 1950:2020)
lpi_subset_rsh <- lpi_subset_rsh[lpi_subset_rsh$value != "NULL",]A sample workflow
Hi Victor, it’s Lizzie. I will just put my notes to you in italics. You really should remove them. In general, I suggest you focus on getting some code/plots and basic text done that you then share next week with everyone.
To show our workflow in action, we examine a subset of the Living Planet Index (LPI, download at: https://www.livingplanetindex.org/data_portal. This download should include a csv file, here: LPD2022_public.csv, that we use below). For more information on the LPI, check out the main text and www.livingplanetindex.org/.
Note: data from Jonathan comes from LivingPlanetIndexDatabase_2025-02-01_19.19.51.
1 Step 1: Research question
I started this, but it would be perfect for JD to fix/add to, so I did not finish it. (As mentioned above, I suggest you focus on getting some code/plots and basic text done that you then share next week with everyone.)
Here we need to focus on exactly what question we are trying to answer, which will guide us in the next steps. This step is sometimes skipped over or moved through so quickly that the exact question is not clear, but is critical for developing a useful model. We might start with: What is the global trend for biodiversity in vetebrate species? This is rather broad and might mean we should develop a model that will yield one global estimate. In thinking about this, we might realize we don’t want just that, we actually want to understand the variation across different species and perhaps populations and also a global estimate. So we might refine the question to:
What is global trend in vertebrate species population numbers, both overall and how does it vary acrosss species and populations?
2 Step 2: Model building
Next we need to think about how we would build a model to address this question, with the goal to build the model and evaluate it using simulated data. There is a lot of demographic models in population ecology, so we might revisit those and consider a discrete time model based on population size and stepping through each year of data. We might also start with a simple linear model, where populations can go up, down or do nothing. That gets us started, but now we need to consider how we will model species and populations. Populations usually are nested in species in ecological models, so we might consider that to start. Our resulting model might then be:
y_i ~ normal(mu, sigma)
mu = alpha[sp[pop]] + beta[sp[pop]]*year
Need to finish the notation as I always have to look up the nesting notation.
3 Step 3: Model evaluation
Now we need to simulate data to test our model. As we start to think about how to do this, we may become overwhelmed with the scale of the question we started with. We likely would not feel immediately sure how to simulate for all different vertebrate species.
3.1 Stepping backward…
Here we would likely feedback one to two steps. We could step back to Step 2 and consider adding a phylogeny or other structure from which we could then simulate data from, or we could step back to Step 1 and narrow the question (before potentially building it back up.) We believe the latter approach is often better as working on a small scale first can point out more fundamental problems versus starting big and then spending much longer to identify fundamental problems.
Obviously we can charge forward with the full model if you want, I just thought this would be (a) easier for us, (b) way better to show as it’s how you should do it and (c) related to a, I suspect JD can be more help, though any country with a good amount of data could work. So here we refine our question to:
What is trend in mammal species population numbers in Africa, both overall and how does it vary acrosss species and populations? Note: we will actually focus on African mammals with 10+ populations with 5+ observations per population.
3.2 Simulating data
Now we start to simulate! Sorry, I did not have time to do this, but I would simulate from the above model or some other simple one you come up with. I would simulate really NICE data, not realistic data.
nspecies <- 9
npops_persp <- round(runif(nspecies, 2, 15)) # no. of different populations per species
npops <- sum(npops_persp) # total no. of different populations
nobs_perpop <- round(runif(npops, 10, 40)) # no. of different observations per population
spid_perpop <- rep(1:nspecies, times = npops_persp)
spid_perobs <- rep(spid_perpop, times = nobs_perpop)
popid_perobs <- rep(1:npops, times = nobs_perpop)
years <- c()
for(np in nobs_perpop){
years_p <- sample(1950:2020, size = np, replace = FALSE)
years <- c(years, years_p)
}
# nested intercepts
mu_species_alpha <- rnorm(n = nspecies, mean = log(200), sd = log(5))
sig_species_alpha <- rnorm(n = nspecies, mean = 5, sd = 5)
mu_pop_alpha <- rnorm(n = npops, mean = mu_species_alpha[spid_perpop])
sig_pop_alpha <- abs(rnorm(n = npops))
alpha <- rnorm(n = npops, mean = mu_pop_alpha, sd = sig_pop_alpha)
# nested slopes
mu_species_beta <- rnorm(n = nspecies, mean = 0, sd = 0.01)
sig_species_beta <- rnorm(n = nspecies, mean = 0, sd = 0.05)
mu_pop_beta <- rnorm(n = npops, mean = mu_species_beta[spid_perpop], sd = 0.05)
sig_pop_beta <- abs(rnorm(n = npops, mean = 0, sd = 0.01))
beta <- rnorm(n = npops, mean = mu_pop_beta, sd = sig_pop_beta)
# save simulated intercepts and slopes
pops_fit <- data.frame()
pid <- 1
for(sp in 1:nspecies){
for(p in 1:npops_persp[sp]){
pops_fit <- rbind(pops_fit, data.frame(species = sp, pop = pid))
pid <- pid + 1
}
}
pops_fit$int <- mu_pop_alpha
pops_fit$slp <- mu_pop_beta
sigma_obs <- abs(rnorm(1, mean = 0.25, sd = 0.25))
y <- c()
for(p in 1:npops){
nobs_p <- nobs_perpop[p]
alpha_p <- alpha[p]
beta_p <- beta[p]
years_p <- years[popid_perobs == p]
yhat_p <- alpha_p + beta_p * (years_p - 1980)
# yhat_p <- alpha_p
y_p <- exp(rnorm(n = nobs_p, mean = yhat_p, sd = sigma_obs))
y <- c(y, y_p)
}
dataplot <- data.frame(
y,
year = years - 1980,
species = as.character(spid_perobs),
population = paste0(spid_perobs,popid_perobs)
)
dataplot$logy <- log(dataplot$y)
ggplot(data = dataplot) +
geom_line(aes(x = year+1980, y = y,
group = paste0(species, population),
color = species)) +
theme_bw() +
labs(x = 'Year', y = 'Simulated counts', colour = 'Species')ggplot(data = dataplot) +
geom_line(aes(x = year+1980, y = logy,
group = paste0(species, population),
color = species)) +
theme_bw() +
labs(x = 'Year', y = 'Simulated counts (log-scale)', colour = 'Species')3.3 Fit model on simulated data!
nestlm <- lm(logy ~ species/population + year * species/population, data = dataplot)
coefs <- summary(nestlm)$coefficients3.3.1 Retrodictive check with model fitted on simulated data
retrodata <- data.frame(obs = dataplot$logy, pred = predict(nestlm),
year = dataplot$year,
species = dataplot$species, id = dataplot$population)
ggplot(data = retrodata) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
facet_wrap(~ paste0("Species ", species), scales = "free") +
geom_point(aes(x = obs, y = pred, group = id, color = as.character(id)),
size = 0.5) +
theme_bw() + theme(legend.position = 'none') +
labs(x = 'Simulated counts (log-scale)', y = 'Estimated counts (log-scale)')ggplot(data = retrodata) +
facet_wrap(~ paste0("Species ", species), scales = "free") +
geom_point(aes(x = year, y = obs, group = id, color = as.character(id)),
size = 1) +
geom_line(aes(x = year, y = pred, group = id, color = as.character(id))) +
theme_bw() + theme(legend.position = 'none') +
labs(x = 'Year', y = 'Counts (log-scale)')ggplot(data = retrodata) +
facet_wrap(~ paste0("Species ", species), scales = "free") +
geom_point(aes(x = year, y = exp(obs), group = id, color = as.character(id)),
size = 1) +
geom_line(aes(x = year, y = exp(pred), group = id, color = as.character(id))) +
theme_bw() + theme(legend.position = 'none') +
labs(x = 'Year', y = 'Counts')3.3.2 Inference with model fitted on simulated data
The model does a fair job at recovering population-level parameters.
em <- data.frame(emmeans::emmeans(nestlm, ~ species/population))NOTE: A nesting structure was detected in the fitted model:
population %in% species
NOTE: Results may be misleading due to involvement in interactions
pops_fit$int_fit <- em$emmean
pops_fit$int_fit.lcl <- em$lower.CL
pops_fit$int_fit.ucl <- em$upper.CL
em <- data.frame(emmeans::emtrends(nestlm, ~ species/population, var = 'year'))NOTE: A nesting structure was detected in the fitted model:
population %in% species
pops_fit$slp_fit <- em$year.trend
pops_fit$slp_fit.lcl <- em$lower.CL
pops_fit$slp_fit.ucl <- em$upper.CL
ggplot(data = pops_fit) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
geom_pointrange(aes(ymin = int_fit.lcl, ymax = int_fit.ucl, x = int, y = int_fit, color = as.character(species))) +
theme_bw() +
labs(x = 'Simulated intercepts', y = 'Estimated intercepts', colour = 'Species')ggplot(data = pops_fit) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
geom_pointrange(aes(ymin = slp_fit.lcl, ymax = slp_fit.ucl, x = slp, y = slp_fit, color = as.character(species))) +
theme_bw() +
labs(x = 'Simulated slopes', y = 'Estimated slopes', colour = 'Species')4 Step 4: Fit model to empirical data
lpi_subset <- read.table("/home/victor/projects/forecastflows/analyses/jdavies/African_mammals.txt")
lpi_subset <- lpi_subset[, c("ID", "Genus", "Species", "Units", paste0("X", 1950:2020))]
lpi_subset$genus_species <- paste(lpi_subset$Genus, lpi_subset$Species)
lpi_subset_rsh <-
reshape(lpi_subset, direction = "long", idvar = "ID",
varying = paste0("X", 1950:2020), v.names = c("value"), timevar = "year", times = 1950:2020)
lpi_subset_rsh <- lpi_subset_rsh[lpi_subset_rsh$value != "NULL",]
lpi_subset_rsh$value <- as.numeric(lpi_subset_rsh$value)
units_to_keep <- c("Estimate - entire population", "Estimate - entire population (August)",
"Individual", "individuals", "Individuals", "Number of adults", "number of individuals",
"Number of individuals", "Number of Individuals", "Number of rhinos", "numbers of indviduals",
"population estimate", "Population estimate", "Population estimate (individuals)",
"Population estimates", "Total population size")
lpi_subset_rsh <- lpi_subset_rsh[lpi_subset_rsh$Units %in% units_to_keep,]
ggplot(data = lpi_subset_rsh) +
geom_line(aes(x = year, y = value,
group = paste0(genus_species, ID),
color = genus_species)) +
theme_bw()ggplot(data = lpi_subset_rsh[lpi_subset_rsh$value != 0,]) +
geom_line(aes(x = year, y = log(value),
group = paste0(genus_species, ID),
color = genus_species)) +
theme_bw()runmodel <- TRUE
lpi_subset_rsh <- lpi_subset_rsh[lpi_subset_rsh$value != 0,]
lpi_subset_rsh$year <- lpi_subset_rsh$year-1980
lpi_subset_rsh$logvalue <- log(lpi_subset_rsh$value)
lpi_subset_rsh$ID <- as.character(lpi_subset_rsh$ID)
if(runmodel){
nestlm <- lm(logvalue ~ genus_species/ID + year * genus_species/ID, data = lpi_subset_rsh)
saveRDS(nestlm, file = file.path(wd, "output/fit/nestmodel1.rds"))
}else{
nestlm <- readRDS(file.path(wd, "output/fit/nestmodel1.rds"))
}
coefs <- summary(nestlm)$coefficients
species <- unique(lpi_subset_rsh$genus_species)[order(unique(lpi_subset_rsh$genus_species))]
species_intercept <- c(coefs["(Intercept)", "Estimate"], sapply(species[2:length(species)], function(sp){coefs["(Intercept)", "Estimate"] + coefs[paste0("genus_species", sp), "Estimate"]}))
species_intercept_sd <- c(coefs["(Intercept)", "Std. Error"], sapply(species[2:length(species)], function(sp){coefs[paste0("genus_species", sp), "Std. Error"]}))
intercepts <- data.frame(est=species_intercept, est.sd=species_intercept_sd, species)
species_slope <- c(coefs["year", "Estimate"], sapply(species[2:length(species)], function(sp){coefs["year", "Estimate"] + coefs[paste0("year:genus_species", sp), "Estimate"]}))
species_slope_sd <- c(coefs["year", "Std. Error"], sapply(species[2:length(species)], function(sp){coefs[paste0("year:genus_species", sp), "Std. Error"]}))
slopes <- data.frame(est=species_slope, est.sd=species_slope_sd, species)Here we finally get to see how our model performs on the empirical data and are a step closer to answering our question, but we also start to notice some differences in our empirical data from our simulated data. In preparing our data for the model, we realize we have major gaps in sampling years and … something else?
Here, we might step back and adjust our model (re-doing Step 2-3 again), but in this case we go ahead.
See starterLPI.R for what I have so far with the data…
5 Step 5: Retrodictive check
retrodata <- data.frame(obs = lpi_subset_rsh$logvalue, pred = predict(nestlm),
year = lpi_subset_rsh$year, units = lpi_subset_rsh$Units,
species = lpi_subset_rsh$genus_species, id = lpi_subset_rsh$ID)
ggplot(data = retrodata) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
facet_wrap(~ species, scales = "free") +
geom_point(aes(x = obs, y = pred, group = id, color = as.character(id)),
size = 0.5) +
theme_bw() + theme(legend.position = 'none')ggplot(data = retrodata[retrodata$species == "Loxodonta africana", ]) +
facet_wrap(~ id, scales = "free") +
geom_point(aes(x = year, y = obs, group = id, color = id),
size = 1) +
geom_line(aes(x = year, y = pred, group = id, color = id)) +
theme_bw() + theme(legend.position = 'none')6 Feedbacks
At this point, we would have a lot of ideas of ways to improve this model that we would want to follow up on, including questioning our assumption of one sigma for all populations and potentially our linearity assumption. We also may want to add predictors for the slope such as … We would do this, returning to Steps 2-5 likely several times until we have a model that is good enough for our purposes here.
And then we would likely add in more countries and species….
I might also add a short paragraph about moving away from linear regression and towards models for species abudance (via mixtures I think? I suspect the articles I posted for class have some good ideas) and thinking more about linking to demography models.
And then we would realize we really don’t have great data for this question and we would …